Add RISCV64 RVV backend #1037

mjosaarinen · 2025-05-19T14:57:15Z

Summary:
rv64v support (risc-v vector extension 1.0, which is available on newer application-class silicon.)

Do you expect this change to impact performance: Yes/No
yes (risc-v only)

If yes, please provide local benchmarking results.
Roughly 2.5x perf on silicon with vector hardware.

hanno-becker · 2025-05-19T18:42:47Z

@mjosaarinen If you have nix setup, running autogen should hopefully resolve the linting issues.

rod-chapman · 2025-05-19T20:51:09Z

Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end.

mjosaarinen · 2025-05-19T21:05:08Z

Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end.

Yeah you can easily double the speed with autovectorization alone, and some Google folks were of the opinion that they wanted to rely on that entirely in BoringSSL (RISC-V Android etc), rather than maintain a hand-optimized version. The resulting code is pretty wild; I looked at that when considering RISC-V ISA extensions ( see slides 17 for example in https://mjos.fi/doc/20240325-rwc-riscv.pdf ). It was almost "too good" -- I suspect that Google has used those NTTs as a microbenchmark when developing LLVM autovectorizers :)

mjosaarinen · 2025-05-19T21:24:06Z

@mjosaarinen If you have nix setup, running autogen should hopefully resolve the linting issues.

Yeah, sorry for abusing your CI like that (I wasn't expecting it to be that extensive), I could have just read the documentation. I'll set up this nix thing.

hanno-becker · 2025-05-20T07:20:41Z

@mjosaarinen Sorry, we should have pointed that out earlier. With the nix environment, you should not need to waste anymore time making the linter happy. Just run format && autogen before pushing.

mlkem/native/riscv64/src/rv64v_settings.h

mlkem/native/riscv64/src/rv64v_poly.c

hanno-becker · 2025-05-27T03:54:22Z

@mkannwischer @mjosaarinen The RV functional tests in the ci-cross shell don't seem to pass.

% nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross
% CROSS_PREFIX=riscv64-unknown-linux-gnu- make func OPT=1 AUTO=1 -j32
% EXEC_WRAPPER=qemu-riscv64 make run_func_512 -j32 OPT=1 AUTO=1
qemu-riscv64 test/build/mlkem512/bin/test_mlkem512
ERROR (test/test_mlkem.c,41)
ERROR (test/test_mlkem.c,225)
make: *** [Makefile:53: run_func_512] Error 1

Same for OPT=0.

Could you investigate?

hanno-becker

The RV64 cross tests in CI are failing. I don't know if this is an issue with the code or the CI setup, but it needs looking into.

mjosaarinen · 2025-05-27T08:05:58Z

The RV64 cross tests in CI are failing. I don't know if this is an issue with the code or the CI setup, but it needs looking into.

It looks like some things are running fine, and some things have build problems. I suspect that platform auto-detection code (in the C header files + makefies) is to blame (failing no_opt means "no optimization"?) It's a quite complicated CI -- can you tell me what would be the best way to debug this stage of the CI build process locally?

hanno-becker · 2025-05-27T08:18:31Z

It's a quite complicated CI -- can you tell me what would be the best way to debug this stage of the CI build process locally?

Sure. If you have nix installed, then it's exactly what I posted above: You open the nix shell with

% nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross

Building using cross compiler:

% CROSS_PREFIX=riscv64-unknown-linux-gnu- make func AUTO=1

And then running:

% qemu-riscv64 test/build/mlkem512/bin/test_mlkem512
% qemu-riscv64 test/build/mlkem768/bin/test_mlkem768
% qemu-riscv64 test/build/mlkem1024/bin/test_mlkem1024

The base command line used by AUTO=1 above is:

riscv64-unknown-linux-gnu-gcc -Wall -Wextra -Werror=unused-result -Wpedantic -Werror 
  -Wmissing-prototypes -Wshadow -Wpointer-arith -Wredundant-decls -Wno-long-long 
  -Wno-unknown-pragmas -Wno-unused-command-line-argument -O3 -fomit-frame-pointer 
  -std=c99 -pedantic -MMD  -march=rv64gcv_zvl256b 
  -DMLK_FORCE_RISCV64 -DMLK_CONFIG_USE_NATIVE_BACKEND_ARITH 
  -DMLK_CONFIG_USE_NATIVE_BACKEND_FIPS202 -DMLK_CONFIG_PARAMETER_SET=1024

What is confusing to me is that this happens regardless of whether OPT=0 or OPT=1 (it defaults to OPT=0). If OPT=0, the new files aren't even compiled.

If you don't want be bothered by auto-settings, we can for now set AUTO=0 and manually specify everything on the command line using CC and CFLAGS:

% CFLAGS="-march=rv64gcv_zvl256b " CC=riscv64-unknown-linux-gnu-gcc make func AUTO=0
...
% qemu-riscv64 test/build/mlkem1024/bin/test_mlkem1024
ERROR (test/test_mlkem.c,49)
ERROR (test/test_mlkem.c,225)

hanno-becker · 2025-05-27T09:27:38Z

@mkannwischer @mjosaarinen

It looks like the issue is solely due to -march=rv64gcv_zvl256b. Even on main, if you do:

CFLAGS="-march=rv64gcv_zvl256b" CC=riscv64-unknown-linux-gnu-gcc make func -j32 AUTO=0 OPT=0
qemu-riscv64 test/build/mlkem512/bin/test_mlkem512

it'll fail with

ERROR (test/test_mlkem.c,41)
ERROR (test/test_mlkem.c,225)

Probably some configuration is missing in the invocation of qemu-riscv64? Nowhere do we say that we're emulating a system with 256-bit vector extension.

hanno-becker · 2025-05-27T09:32:39Z

qemu-riscv64 -cpu rv64,v=true,vlen=256 works. @mjosaarinen is that the right config?

mjosaarinen · 2025-05-27T15:25:51Z

qemu-riscv64 -cpu rv64,v=true,vlen=256 works. @mjosaarinen is that the right config?

Hmm. I'm not sure if qemu is able to pull the standard runtime and dynamic libraries that match with the "vector ABI" (the calling convention changes somewhat with vector registers). I had to link the test programs with -static when I was testing the code. I was testing with spike and vector silicon myself, sorry I have not looked yet at the qemu environment used by CI.

Anyway, the code assumes "vector 1.0" extension, 256-bit VLEN, and the standard "gc" stuff, so hat looks correct. I left the Keccak instruction stuff out.

hanno-becker · 2025-05-28T05:08:17Z

I adjusted the CI. Most tests pass, but there are some issues in the ACVP tests and the monobuild examples. The ACVP one seems to be an issue in the test script (not expecting parameters in the exec-wrapper), while the monobuild failure seems to be an architecture confusion. I'll take a look.

oqs-bot

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Intel Xeon 3rd gen (c6i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite	Current: `01bab43`	Previous: `5aaca94`	Ratio
`ML-KEM-768 encaps`	`31969` cycles	`30015` cycles	`1.07`

This comment was automatically generated by workflow using github-action-benchmark.

hanno-becker · 2025-10-18T04:34:53Z

Use a mulcache. This should be completely straightforward but simplify and accelerate the basemul.

I was wrong. I tried out what seemed like the straightforward mulcache implementation to me:

void mlk_rv64v_poly_mulcache_compute(int16_t x[MLKEM_N / 2],
                                     const int16_t y[MLKEM_N])
{
#include "rv64v_zetas_basemul.inc"

  size_t vl = __riscv_vsetvl_e16m1(MLKEM_N) / 2;
  size_t i;

  for (i = 0; i < MLKEM_N; i += 2 * vl)
  {
    vint16m1_t b1 = 
        __riscv_vget_v_i16m1x2_i16m1(__riscv_vlseg2e16_v_i16m1x2(&y[i], vl), 1);
    vint16m1_t z = __riscv_vle16_v_i16m1(&roots[i / 2], vl);
    vint16m1_t bt = fq_mul_vv(b1, z, vl);
    __riscv_vse16_v_i16m1(&x[i / 2], bt, vl);
  }
}

static inline void mlk_rv64v_poly_basemul_mont_add_k(int16_t *r,
                                                     const int16_t *a,
                                                     const int16_t *b,
                                                     const int16_t *bc,
                                                     unsigned kn)
{
  size_t vl = __riscv_vsetvl_e16m1(MLKEM_N) / 2;
  size_t i, j;

  for (i = 0; i < MLKEM_N; i += 2 * vl)
  {
    vint32m2_t acc0 = __riscv_vmv_v_x_i32m2(0, vl);
    vint32m2_t acc1 = __riscv_vmv_v_x_i32m2(0, vl);

    for (j = 0; j < kn; j += MLKEM_N)
    {
      vint16m1x2_t x = __riscv_vlseg2e16_v_i16m1x2(&a[i + j], vl);
      vint16m1x2_t y = __riscv_vlseg2e16_v_i16m1x2(&b[i + j], vl);

      vint16m1_t x0 = __riscv_vget_v_i16m1x2_i16m1(x, 0);
      vint16m1_t x1 = __riscv_vget_v_i16m1x2_i16m1(x, 1);
      vint16m1_t y0 = __riscv_vget_v_i16m1x2_i16m1(y, 0);
      vint16m1_t y1 = __riscv_vget_v_i16m1x2_i16m1(y, 1);
      vint16m1_t yt = __riscv_vle16_v_i16m1(&bc[(i + j) / 2], vl);

      vint32m2_t x0y0 = __riscv_vwmul_vv_i32m2(x0, y0, vl);
      vint32m2_t x0y1 = __riscv_vwmul_vv_i32m2(x0, y1, vl);
      vint32m2_t x1y0 = __riscv_vwmul_vv_i32m2(x1, y0, vl);
      vint32m2_t x1yt = __riscv_vwmul_vv_i32m2(x1, yt, vl);

      acc0 = __riscv_vadd_vv_i32m2(acc0, x0y0, vl);
      acc0 = __riscv_vadd_vv_i32m2(acc0, x1yt, vl);
      acc1 = __riscv_vadd_vv_i32m2(acc1, x0y1, vl);
      acc1 = __riscv_vadd_vv_i32m2(acc1, x1y0, vl);
    }

    __riscv_vsseg2e16_v_i16m1x2(
        &r[i],
        __riscv_vcreate_v_i16m1x2(fq_redc2(acc0, vl), fq_redc2(acc1, vl)), vl);
  }
}

But this is notably slower. Interestingly, even ignoring the mulcache computation itself (which is almost twice as slow as tomont, oddly), the basemul is slower than @mjosaarinen's, despite the fewer multiplications and additions.

@mjosaarinen What can I learn from this? My guess is that vlseg2 and vsseg2 are very slow operations, but that would be a bit surprising because it's ultimately a combination of gather and load/store, and in your code there's heavy shuffling as well. Anyway, any intuition you can share about what works, and what doesn't, is appreciated.

Signed-off-by: Hanno Becker <[email protected]>

mkannwischer

Three minor comments/questions, but none of them is blocking.
Happy to go ahead and merge this PR and revisit it in a follow-up if needed.

Thank you @mjosaarinen for contributing and thank you @hanno-becker for all the work you have put into this!

mlkem/mlkem_native.S

mlkem/src/native/riscv64/src/rv64v_poly.c

mlkem/src/native/riscv64/src/rv64v_debug.c

…sics.) Signed-off-by: Markku-Juhani O. Saarinen <[email protected]> Signed-off-by: Matthias J. Kannwischer <[email protected]> Signed-off-by: Hanno Becker <[email protected]>

- Don't skip over 'opt' tests in CI when testing RV64 - Configure qemu to use 256-bit VL. Once we support other VLs, the invocation of qemu needs to be generalized accordingly. Signed-off-by: Hanno Becker <[email protected]>

Signed-off-by: Matthias J. Kannwischer <[email protected]>

Signed-off-by: Hanno Becker <[email protected]>

Signed-off-by: Markku-Juhani O. Saarinen <[email protected]>

This commit makes minor changes to the RV64 backend to make the non-NTT component VLEN-agnostic. This mostly means using __riscv_vsetvl_e16m1 instead of a hardcoded VLEN, plus some added tail handling in rejection sampling. For NTT and invNTT, the implementation is specific to VLEN=256, and fallback to the C implementation for other vector lengths. This should be improved in the future. The compile-time guard for the RV64 backend is changed to checking for RVV support only, but no specific VLEN. Similarly, the default build flags use 128 rather then 256 as the minimum VLEN. The README is adjusted accordingly. We also remove the warning regarding the code being experimental, as it has by now undergone thorough review and test. Signed-off-by: Hanno Becker <[email protected]>

Signed-off-by: Hanno Becker <[email protected]>

The backend API requires a bound of 8 * MLKEM_Q on the absolute value of the output coefficients of the NTT. This bound is already satisfied by the current implementation (which is aligned with the AArch64 and x86_64 backends) and hence no final reduction is needed. Signed-off-by: Hanno Becker <[email protected]>

A plain poly add/sub is not part of the backend API because we trust autovectorization to handle it. The corresponding functions mlk_rv64v_poly_add and mlk_rv64v_poly_sub are therefore dead code and removed from the backend. Signed-off-by: Hanno Becker <[email protected]>

Signed-off-by: Hanno Becker <[email protected]>

Use high subtraction rather than addition in the Montgomery reduction to avoid the possibility of a carry-in from the low half. cf https://eprint.iacr.org/2018/039. Signed-off-by: Hanno Becker <[email protected]>

Previously, the invNTT would keep the coefficients in unsigned canonical range [0,MLKEM_Q). This commit changes this to a lazy reduction strategy following the AArch64 and x86_64 backends. Bounds are tracked coarsely and reductions introduced where necessary. The reduction is certainly not optimal, but the returns of further improvements are diminishing while complicating review. Bounds assertions (debug only) are introduced to check the bounds at runtime. Signed-off-by: Hanno Becker <[email protected]>

For VLEN >= 512, there are tail iterations in the rejection handling loop where we require less coefficients than fit into a vector, requiring a adjustment of the dynamic VL. The previous code did re-evaluate the dynamic VL in every iteration, which incurred a signifcant runtime cost. This commit instead splits the rejection sampling loop in two nested loops, where the inner loop proceeds for a fixed VL and the outer loop re-evaluates the VL. For VL <= 256, there is only one iteration of the outer loop, rendering it as efficient as the original verison. Signed-off-by: Hanno Becker <[email protected]>

Signed-off-by: Hanno Becker <[email protected]>

This is required by `-Wconversion`, and is indeed worth a comment since a cast from `unsigned` to `int` can in theory overflow. Also, unify unsigned types used in `rej_uniform` to `unsigned`, and add explicit casts where the RVV intrinsics return `unsigned long`. Signed-off-by: Hanno Becker <[email protected]>

hanno-becker · 2025-10-18T19:00:39Z

Many thanks @mjosaarinen, this is a great addition.

When do you think you'll get to writing a Keccak backend, and adding [inv]NTTs for other VLENs?

mjosaarinen requested a review from a team as a code owner May 19, 2025 14:57

mjosaarinen force-pushed the rv64v-dev branch 3 times, most recently from 0637941 to 6b4f845 Compare May 19, 2025 18:35

mjosaarinen force-pushed the rv64v-dev branch 2 times, most recently from dcab861 to 38e79bb Compare May 19, 2025 20:40

mjosaarinen mentioned this pull request May 19, 2025

Add RISC-V backend #1035

Open

hanno-becker reviewed May 21, 2025

View reviewed changes

mlkem/native/riscv64/src/rv64v_settings.h Outdated Show resolved Hide resolved

hanno-becker reviewed May 21, 2025

View reviewed changes

mlkem/native/riscv64/src/rv64v_poly.c Outdated Show resolved Hide resolved

mjosaarinen force-pushed the rv64v-dev branch from 38e79bb to cb6bb10 Compare May 21, 2025 11:41

hanno-becker force-pushed the rv64v-dev branch from cb6bb10 to 40c1dd8 Compare May 26, 2025 05:46

hanno-becker requested changes May 27, 2025

View reviewed changes

hanno-becker added the RV64 label May 27, 2025

hanno-becker force-pushed the rv64v-dev branch from ae46a88 to 4498f52 Compare May 28, 2025 03:30

hanno-becker mentioned this pull request May 28, 2025

Minor build fixes #1046

Merged

hanno-becker force-pushed the rv64v-dev branch 3 times, most recently from 951ab19 to 0c4d7e8 Compare May 28, 2025 12:19

hanno-becker added the benchmark this PR should be benchmarked in CI label Oct 18, 2025

oqs-bot reviewed Oct 18, 2025

View reviewed changes

autogen: Add support for RV64-backend to come

92ecbea

Signed-off-by: Hanno Becker <[email protected]>

mkannwischer force-pushed the rv64v-dev branch from 01bab43 to 901105a Compare October 18, 2025 10:34

mkannwischer approved these changes Oct 18, 2025

View reviewed changes

mlkem/mlkem_native.S Show resolved Hide resolved

mlkem/src/native/riscv64/src/rv64v_poly.c Show resolved Hide resolved

mlkem/src/native/riscv64/src/rv64v_debug.c Show resolved Hide resolved

hanno-becker force-pushed the rv64v-dev branch 3 times, most recently from 4514c4c to ecb4cef Compare October 18, 2025 18:06

mjosaarinen and others added 16 commits October 18, 2025 19:20

Initial risc-v vector extension support (using standard vector intrin…

5123465

…sics.) Signed-off-by: Markku-Juhani O. Saarinen <[email protected]> Signed-off-by: Matthias J. Kannwischer <[email protected]> Signed-off-by: Hanno Becker <[email protected]>

Adjust CI to RV64 assembly backend

3162c69

- Don't skip over 'opt' tests in CI when testing RV64 - Configure qemu to use 256-bit VL. Once we support other VLs, the invocation of qemu needs to be generalized accordingly. Signed-off-by: Hanno Becker <[email protected]>

RV64: Autogenerate twiddles

aa618c5

Signed-off-by: Matthias J. Kannwischer <[email protected]>

RV64: Add bounds assertions

4ab0bce

Signed-off-by: Hanno Becker <[email protected]>

RV64: Minor style and renaming changes

3698c93

Signed-off-by: Hanno Becker <[email protected]>

Bench: Implement cycle counting based on direct PMU access

b6355f1

Signed-off-by: Markku-Juhani O. Saarinen <[email protected]>

RV64: Add CI tests for RV64 VLEN=128,256,512,1024

9c93e94

Signed-off-by: Hanno Becker <[email protected]>

RV64: Remove unnecessary split/merge operation

f46a086

Signed-off-by: Hanno Becker <[email protected]>

RV64: Avoid carry-in during Montgomery reduction

0f65791

Use high subtraction rather than addition in the Montgomery reduction to avoid the possibility of a carry-in from the low half. cf https://eprint.iacr.org/2018/039. Signed-off-by: Hanno Becker <[email protected]>

Fix typo in mlkem/src/sampling.c

e4e90ef

Signed-off-by: Hanno Becker <[email protected]>

hanno-becker force-pushed the rv64v-dev branch from ecb4cef to 50d4992 Compare October 18, 2025 18:20

hanno-becker merged commit 833fc0a into main Oct 18, 2025
379 checks passed

hanno-becker deleted the rv64v-dev branch October 18, 2025 18:57

mkannwischer mentioned this pull request Oct 19, 2025

RISC-V support open-quantum-safe/liboqs#1416

Open

Add RISCV64 RVV backend #1037

Add RISCV64 RVV backend #1037

Uh oh!

Conversation

mjosaarinen commented May 19, 2025 • edited by hanno-becker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanno-becker commented May 19, 2025

Uh oh!

rod-chapman commented May 19, 2025

Uh oh!

mjosaarinen commented May 19, 2025

Uh oh!

mjosaarinen commented May 19, 2025

Uh oh!

hanno-becker commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanno-becker commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanno-becker left a comment

Choose a reason for hiding this comment

Uh oh!

mjosaarinen commented May 27, 2025

Uh oh!

hanno-becker commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanno-becker commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanno-becker commented May 27, 2025

Uh oh!

mjosaarinen commented May 27, 2025

Uh oh!

hanno-becker commented May 28, 2025

Uh oh!

oqs-bot left a comment

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

hanno-becker commented Oct 18, 2025

Uh oh!

mkannwischer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanno-becker commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mjosaarinen commented May 19, 2025 •

edited by hanno-becker

Loading

hanno-becker commented May 20, 2025 •

edited

Loading

hanno-becker commented May 27, 2025 •

edited

Loading

hanno-becker commented May 27, 2025 •

edited

Loading

hanno-becker commented May 27, 2025 •

edited

Loading

hanno-becker commented Oct 18, 2025 •

edited

Loading